Video understanding platform

ABSTRACT

In one embodiment, a method includes accessing a video-content object, determining a first feature vector representing the video-content object using a first recognition module of a first type based on an object in the video-content object, and determining a second feature vector representing the video-content object using a second recognition module of a second type based on the first feature vector. The first type is different from the second type. The method also includes determining a context of the video-content object based on the second feature vector.

PRIORITY

This application is a continuation under 35 U.S.C. § 120 of U.S. patentapplication Ser. No. 15/395,511, filed 30 Dec. 2016.

TECHNICAL FIELD

This disclosure generally relates to computer vision.

BACKGROUND

Computer vision is a computational process (or set of computationalprocesses) that facilitates machine understanding of the content of animage or set of images, such as a video. For example, computer visionmay involve automatically extracting features from an image, analyzingthem, and generating an explicit description or categorization of theimage. Applications of computer vision include controlling processes andsystems, navigation, event detection, organizing information, modelingobjects or environments, and automatic inspection.

A social-networking system, which may include a social-networkingwebsite, may enable its users (such as persons or organizations) tointeract with it and with each other through it. The social-networkingsystem may, with input from a user, create and store in thesocial-networking system a user profile associated with the user. Theuser profile may include demographic information, communication-channelinformation, and information on personal interests of the user. Thesocial-networking system may also, with input from a user, create andstore a record of relationships of the user with other users of thesocial-networking system, as well as provide services (e.g., wall posts,photo-sharing, event organization, messaging, games, or advertisements)to facilitate social interaction between or among users.

SUMMARY OF PARTICULAR EMBODIMENTS

In particular embodiments, a video understanding platform may be trainedby machine learning to make a prediction about a video-content objectbased on one or more of: frames of the video-content object, audio ofthe video-content object, and text associated with the video-contentobject. In particular embodiments, a video understanding platform maycomprise a video-recognition model, an audio-recognition model, and atext-recognition model. A video-recognition model may be trained bymachine learning to make a prediction about a video-content object basedon an analysis of one or more frames (e.g., a still image) of thevideo-content object. An audio-recognition model may be trained bymachine learning to make a prediction about a video-content object basedon an analysis of part or all of the audio of a video-content object(e.g., speech identification, language identification, soundidentification, source separation, etc.). A text-recognition module maybe trained by machine learning to make a prediction about avideo-content object based on text associated with the video-contentobject (e.g., posts or comments associated with a video-content objectposted on an online social network, text metadata associated with thevideo-content object, topic classification information associated withthe video-content object, intent understanding information associatedwith the video-content object, etc.). In particular embodiments, aprediction about a video-content object may comprise a context, apredicted future action, a predicted object, a predicted motion, or anyother suitable prediction. A context of a video-content object may beone or more n-grams that describe the video-content object or an aspectof the video-content object (e.g., a description of objects or actionsdepicted, a category of the video-content object, etc.). In particularembodiments, a computer-vision platform may update a prediction about avideo-content object based on information not used to make a priorprediction (e.g., information received after the prior prediction wasmade). As an example and not by way of limitation, a video-contentobject may be a video that is streamed live and information (e.g.,likes, comments, shares, video content, etc.) may be received in anongoing manner and the computer-vision platform may update a predictionbased on this information. Although this disclosure may describe aparticular video understanding platform, this disclosure contemplatesany suitable video understanding platform.

The embodiments disclosed herein are only examples, and the scope ofthis disclosure is not limited to them. Particular embodiments mayinclude all, some, or none of the components, elements, features,functions, operations, or steps of the embodiments disclosed above.Embodiments according to the invention are in particular disclosed inthe attached claims directed to a method, a storage medium, a system anda computer program product, wherein any feature mentioned in one claimcategory, e.g. method, can be claimed in another claim category, e.g.system, as well. The dependencies or references back in the attachedclaims are chosen for formal reasons only. However any subject matterresulting from a deliberate reference back to any previous claims (inparticular multiple dependencies) can be claimed as well, so that anycombination of claims and the features thereof are disclosed and can beclaimed regardless of the dependencies chosen in the attached claims.The subject-matter which can be claimed comprises not only thecombinations of features as set out in the attached claims but also anyother combination of features in the claims, wherein each featurementioned in the claims can be combined with any other feature orcombination of other features in the claims. Furthermore, any of theembodiments and features described or depicted herein can be claimed ina separate claim and/or in any combination with any embodiment orfeature described or depicted herein or with any of the features of theattached claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example network environment associated with asocial-networking system.

FIG. 2 illustrates an example social graph.

FIG. 3 illustrates an example view of a vector space.

FIG. 4 illustrates an example video understanding engine.

FIG. 5 illustrates an example method for determining a context of avideo-content object.

FIG. 6 illustrates an example computer system.

DESCRIPTION OF EXAMPLE EMBODIMENTS

FIG. 1 illustrates an example network environment 100 associated with asocial-networking system. Network environment 100 includes a user 101, aclient system 130, a social-networking system 160, and a third-partysystem 170 connected to each other by a network 110. Although FIG. 1illustrates a particular arrangement of user 101, client system 130,social-networking system 160, third-party system 170, and network 110,this disclosure contemplates any suitable arrangement of user 101,client system 130, social-networking system 160, third-party system 170,and network 110. As an example and not by way of limitation, two or moreof client system 130, social-networking system 160, and third-partysystem 170 may be connected to each other directly, bypassing network110. As another example, two or more of client system 130,social-networking system 160, and third-party system 170 may bephysically or logically co-located with each other in whole or in part.Moreover, although FIG. 1 illustrates a particular number of users 101,client systems 130, social-networking systems 160, third-party systems170, and networks 110, this disclosure contemplates any suitable numberof users 101, client systems 130, social-networking systems 160,third-party systems 170, and networks 110. As an example and not by wayof limitation, network environment 100 may include multiple users 101,client system 130, social-networking systems 160, third-party systems170, and networks 110.

In particular embodiments, user 101 may be an individual (human user),an entity (e.g., an enterprise, business, or third-party application),or a group (e.g., of individuals or entities) that interacts orcommunicates with or over social-networking system 160. In particularembodiments, social-networking system 160 may be a network-addressablecomputing system hosting an online social network. Social-networkingsystem 160 may generate, store, receive, and send social-networkingdata, such as, for example, user-profile data, concept-profile data,social-graph information, or other suitable data related to the onlinesocial network. Social-networking system 160 may be accessed by theother components of network environment 100 either directly or vianetwork 110. In particular embodiments, social-networking system 160 mayinclude an authorization server (or other suitable component(s)) thatallows users 101 to opt in to or opt out of having their actions loggedby social-networking system 160 or shared with other systems (e.g.,third-party systems 170), for example, by setting appropriate privacysettings. A privacy setting of a user may determine what informationassociated with the user may be logged, how information associated withthe user may be logged, when information associated with the user may belogged, who may log information associated with the user, whominformation associated with the user may be shared with, and for whatpurposes information associated with the user may be logged or shared.Authorization servers may be used to enforce one or more privacysettings of the users of social-networking system 30 through blocking,data hashing, anonymization, or other suitable techniques asappropriate. Third-party system 170 may be accessed by the othercomponents of network environment 100 either directly or via network110. In particular embodiments, one or more users 101 may use one ormore client systems 130 to access, send data to, and receive data fromsocial-networking system 160 or third-party system 170. Client system130 may access social-networking system 160 or third-party system 170directly, via network 110, or via a third-party system. As an exampleand not by way of limitation, client system 130 may access third-partysystem 170 via social-networking system 160. Client system 130 may beany suitable computing device, such as, for example, a personalcomputer, a laptop computer, a cellular telephone, a smartphone, atablet computer, or an augmented/virtual reality device.

This disclosure contemplates any suitable network 110. As an example andnot by way of limitation, one or more portions of network 110 mayinclude an ad hoc network, an intranet, an extranet, a virtual privatenetwork (VPN), a local area network (LAN), a wireless LAN (WLAN), a widearea network (WAN), a wireless WAN (WWAN), a metropolitan area network(MAN), a portion of the Internet, a portion of the Public SwitchedTelephone Network (PSTN), a cellular telephone network, or a combinationof two or more of these. Network 110 may include one or more networks110.

Links 150 may connect client system 130, social-networking system 160,and third-party system 170 to communication network 110 or to eachother. This disclosure contemplates any suitable links 150. Inparticular embodiments, one or more links 150 include one or morewireline (such as for example Digital Subscriber Line (DSL) or Data OverCable Service Interface Specification (DOCSIS)), wireless (such as forexample Wi-Fi or Worldwide Interoperability for Microwave Access(WiMAX)), or optical (such as for example Synchronous Optical Network(SONET) or Synchronous Digital Hierarchy (SDH)) links. In particularembodiments, one or more links 150 each include an ad hoc network, anintranet, an extranet, a VPN, a LAN, a WLAN, a WAN, a WWAN, a MAN, aportion of the Internet, a portion of the PSTN, a cellulartechnology-based network, a satellite communications technology-basednetwork, another link 150, or a combination of two or more such links150. Links 150 need not necessarily be the same throughout networkenvironment 100. One or more first links 150 may differ in one or morerespects from one or more second links 150.

FIG. 2 illustrates example social graph 200. In particular embodiments,social-networking system 160 may store one or more social graphs 200 inone or more data stores. In particular embodiments, social graph 200 mayinclude multiple nodes—which may include multiple user nodes 202 ormultiple concept nodes 204—and multiple edges 206 connecting the nodes.Example social graph 200 illustrated in FIG. 2 is shown, for didacticpurposes, in a two-dimensional visual map representation. In particularembodiments, a social-networking system 160, client system 130, orthird-party system 170 may access social graph 200 and relatedsocial-graph information for suitable applications. The nodes and edgesof social graph 200 may be stored as data objects, for example, in adata store (such as a social-graph database). Such a data store mayinclude one or more searchable or queryable indexes of nodes or edges ofsocial graph 200.

In particular embodiments, a user node 202 may correspond to a user ofsocial-networking system 160. As an example and not by way oflimitation, a user may be an individual (human user), an entity (e.g.,an enterprise, business, or third-party application), or a group (e.g.,of individuals or entities) that interacts or communicates with or oversocial-networking system 160. In particular embodiments, when a userregisters for an account with social-networking system 160,social-networking system 160 may create a user node 202 corresponding tothe user, and store the user node 202 in one or more data stores. Usersand user nodes 202 described herein may, where appropriate, refer toregistered users and user nodes 202 associated with registered users. Inaddition or as an alternative, users and user nodes 202 described hereinmay, where appropriate, refer to users that have not registered withsocial-networking system 160. In particular embodiments, a user node 202may be associated with information provided by a user or informationgathered by various systems, including social-networking system 160. Asan example and not by way of limitation, a user may provide his or hername, profile picture, contact information, birth date, sex, maritalstatus, family status, employment, education background, preferences,interests, or other demographic information. In particular embodiments,a user node 202 may be associated with one or more data objectscorresponding to information associated with a user. In particularembodiments, a user node 202 may correspond to one or more webpages.

In particular embodiments, a concept node 204 may correspond to aconcept. As an example and not by way of limitation, a concept maycorrespond to a place (such as, for example, a movie theater,restaurant, landmark, or city); a website (such as, for example, awebsite associated with social-network system 160 or a third-partywebsite associated with a web-application server); an entity (such as,for example, a person, business, group, sports team, or celebrity); aresource (such as, for example, an audio file, video file, digitalphoto, text file, structured document, or application) which may belocated within social-networking system 160 or on an external server,such as a web-application server; real or intellectual property (suchas, for example, a sculpture, painting, movie, game, song, idea,photograph, or written work); a game; an activity; an idea or theory; anobject in a augmented/virtual reality environment; another suitableconcept; or two or more such concepts. A concept node 204 may beassociated with information of a concept provided by a user orinformation gathered by various systems, including social-networkingsystem 160. As an example and not by way of limitation, information of aconcept may include a name or a title; one or more images (e.g., animage of the cover page of a book); a location (e.g., an address or ageographical location); a website (which may be associated with a URL);contact information (e.g., a phone number or an email address); othersuitable concept information; or any suitable combination of suchinformation. In particular embodiments, a concept node 204 may beassociated with one or more data objects corresponding to informationassociated with concept node 204. In particular embodiments, a conceptnode 204 may correspond to one or more webpages.

In particular embodiments, a node in social graph 200 may represent orbe represented by a webpage (which may be referred to as a “profilepage”). Profile pages may be hosted by or accessible tosocial-networking system 160. Profile pages may also be hosted onthird-party websites associated with a third-party system 170. As anexample and not by way of limitation, a profile page corresponding to aparticular external webpage may be the particular external webpage andthe profile page may correspond to a particular concept node 204.Profile pages may be viewable by all or a selected subset of otherusers. As an example and not by way of limitation, a user node 202 mayhave a corresponding user-profile page in which the corresponding usermay add content, make declarations, or otherwise express himself orherself. As another example and not by way of limitation, a concept node204 may have a corresponding concept-profile page in which one or moreusers may add content, make declarations, or express themselves,particularly in relation to the concept corresponding to concept node204.

In particular embodiments, a concept node 204 may represent athird-party webpage or resource hosted by a third-party system 170. Thethird-party webpage or resource may include, among other elements,content, a selectable or other icon, or other inter-actable object(which may be implemented, for example, in JavaScript, AJAX, or PHPcodes) representing an action or activity. As an example and not by wayof limitation, a third-party webpage may include a selectable icon suchas “like,” “check-in,” “eat,” “recommend,” or another suitable action oractivity. A user viewing the third-party webpage may perform an actionby selecting one of the icons (e.g., “check-in”), causing a clientsystem 130 to send to social-networking system 160 a message indicatingthe user's action. In response to the message, social-networking system160 may create an edge (e.g., a check-in-type edge) between a user node202 corresponding to the user and a concept node 204 corresponding tothe third-party webpage or resource and store edge 206 in one or moredata stores.

In particular embodiments, a pair of nodes in social graph 200 may beconnected to each other by one or more edges 206. An edge 206 connectinga pair of nodes may represent a relationship between the pair of nodes.In particular embodiments, an edge 206 may include or represent one ormore data objects or attributes corresponding to the relationshipbetween a pair of nodes. As an example and not by way of limitation, afirst user may indicate that a second user is a “friend” of the firstuser. In response to this indication, social-networking system 160 maysend a “friend request” to the second user. If the second user confirmsthe “friend request,” social-networking system 160 may create an edge206 connecting the first user's user node 202 to the second user's usernode 202 in social graph 200 and store edge 206 as social-graphinformation in one or more of data stores 164. In the example of FIG. 2,social graph 200 includes an edge 206 indicating a friend relationbetween user nodes 202 of user “A” and user “B” and an edge indicating afriend relation between user nodes 202 of user “C” and user “B.”Although this disclosure describes or illustrates particular edges 206with particular attributes connecting particular user nodes 202, thisdisclosure contemplates any suitable edges 206 with any suitableattributes connecting user nodes 202. As an example and not by way oflimitation, an edge 206 may represent a friendship, family relationship,business or employment relationship, fan relationship (including, e.g.,liking, etc.), follower relationship, visitor relationship (including,e.g., accessing, viewing, checking-in, sharing, etc.), subscriberrelationship, superior/subordinate relationship, reciprocalrelationship, non-reciprocal relationship, another suitable type ofrelationship, or two or more such relationships. Moreover, although thisdisclosure generally describes nodes as being connected, this disclosurealso describes users or concepts as being connected. Herein, referencesto users or concepts being connected may, where appropriate, refer tothe nodes corresponding to those users or concepts being connected insocial graph 200 by one or more edges 206.

In particular embodiments, an edge 206 between a user node 202 and aconcept node 204 may represent a particular action or activity performedby a user associated with user node 202 toward a concept associated witha concept node 204. As an example and not by way of limitation, asillustrated in FIG. 2, a user may “like,” “attended,” “played,”“listened,” “cooked,” “worked at,” or “watched” a concept, each of whichmay correspond to an edge type or subtype. A concept-profile pagecorresponding to a concept node 204 may include, for example, aselectable “check in” icon (such as, for example, a clickable “check in”icon) or a selectable “add to favorites” icon. Similarly, after a userclicks these icons, social-networking system 160 may create a “favorite”edge or a “check in” edge in response to a user's action correspondingto a respective action. As another example and not by way of limitation,a user (user “C”) may listen to a particular song (“Imagine”) using aparticular application (SPOTIFY, which is an online music application).In this case, social-networking system 160 may create a “listened” edge206 and a “used” edge (as illustrated in FIG. 2) between user nodes 202corresponding to the user and concept nodes 204 corresponding to thesong and application to indicate that the user listened to the song andused the application. Moreover, social-networking system 160 may createa “played” edge 206 (as illustrated in FIG. 2) between concept nodes 204corresponding to the song and the application to indicate that theparticular song was played by the particular application. In this case,“played” edge 206 corresponds to an action performed by an externalapplication (SPOTIFY) on an external audio file (the song “Imagine”).Although this disclosure describes particular edges 206 with particularattributes connecting user nodes 202 and concept nodes 204, thisdisclosure contemplates any suitable edges 206 with any suitableattributes connecting user nodes 202 and concept nodes 204. Moreover,although this disclosure describes edges between a user node 202 and aconcept node 204 representing a single relationship, this disclosurecontemplates edges between a user node 202 and a concept node 204representing one or more relationships. As an example and not by way oflimitation, an edge 206 may represent both that a user likes and hasused at a particular concept. Alternatively, another edge 206 mayrepresent each type of relationship (or multiples of a singlerelationship) between a user node 202 and a concept node 204 (asillustrated in FIG. 2 between user node 202 for user “E” and conceptnode 204 for “SPOTIFY”).

In particular embodiments, social-networking system 160 may create anedge 206 between a user node 202 and a concept node 204 in social graph200. As an example and not by way of limitation, a user viewing aconcept-profile page (such as, for example, by using a web browser or aspecial-purpose application hosted by the user's client system 130) mayindicate that he or she likes the concept represented by the conceptnode 204 by clicking or selecting a “Like” icon, which may cause theuser's client system 130 to send to social-networking system 160 amessage indicating the user's liking of the concept associated with theconcept-profile page. In response to the message, social-networkingsystem 160 may create an edge 206 between user node 202 associated withthe user and concept node 204, as illustrated by “like” edge 206 betweenthe user and concept node 204. In particular embodiments,social-networking system 160 may store an edge 206 in one or more datastores. In particular embodiments, an edge 206 may be automaticallyformed by social-networking system 160 in response to a particular useraction. As an example and not by way of limitation, if a first useruploads a picture, watches a movie, or listens to a song, an edge 206may be formed between user node 202 corresponding to the first user andconcept nodes 204 corresponding to those concepts. Although thisdisclosure describes forming particular edges 206 in particular manners,this disclosure contemplates forming any suitable edges 206 in anysuitable manner.

FIG. 3 illustrates an example view of a vector space 300. Vector space300 may also be referred to as a feature space or an embedding space. Inparticular embodiments, an object or an n-gram may be represented in ad-dimensional vector space, where d denotes any suitable number ofdimensions. An object may represent data, such as audio data or videodata. Although the vector space 300 is illustrated as athree-dimensional space, this is for illustrative purposes only, as thevector space 300 may be of any suitable dimension. In particularembodiments, an object may be represented in the vector space 300 as afeature vector. A feature vector may also be referred to as anembedding. Each vector may comprise coordinates corresponding to aparticular point in the vector space 300 (i.e., the terminal point ofthe vector). As an example and not by way of limitation, feature vectors310, 320, and 330 may be represented as points in the vector space 300,as illustrated in FIG. 3. An object may be mapped to a respective vectorrepresentation. As an example and not by way of limitation, objects t₁and t₂ may be mapped to feature vectors

and

in the vector space 300, respectively, by applying a function

. The function

may map objects to feature vectors by feature extraction, which maystart from an initial set of measured data and build derived values(e.g., features). When an object has data that is either too large to beefficiently processed or comprises redundant data,

may map the object to a feature vector using a transformed reduced setof features (e.g., feature selection). A feature vector may compriseinformation related to the object. In particular embodiments, an objectmay be mapped to a feature vector based on one or more properties,attributes, or features of the object, relationships of the object withother objects, or any other suitable information associated with theobject. As an example and not by way of limitation, an object comprisinga video or an image may be mapped to a vector representation in thevector space 300 by using an algorithm to detect or isolate variousdesired portions or shapes of the object. Features of the feature vectormay be based on information obtained from edge detection, cornerdetection, blob detection, ridge detection, scale-invariant featuretransformation, edge direction, changing intensity, autocorrelation,motion detection, optical flow, thresholding, blob extraction, templatematching, Hough transformation (e.g., lines, circles, ellipses,arbitrary shapes), or any other suitable information. As another exampleand not by way of limitation, an object comprising audio data may bemapped to a feature vector based on features such as a spectral slope, atonality coefficient, an audio spectrum centroid, an audio spectrumenvelope, a Mel-frequency cepstrum, or any other suitable information.In particular embodiments, an n-gram may be mapped to a feature vectorby a dictionary trained to map text to a feature vector. As an exampleand not by way of limitation, a model, such as Word2vec, may be used tomap an n-gram to a feature vector. In particular embodiments, featurevectors or embeddings may be robust to basic changes like text additionor changes to aspect ratio. In particular embodiments, social-networkingsystem 160 may map objects of different modalities (e.g., visual, audio,text) to a particular vector space or using a separate function. Inparticular embodiments, social-networking system 160 may map objects ofdifferent modalities to the same vector space or use a function jointlytrained to map one or more modalities to a feature vector (e.g., betweenvisual, audio, text). Although this disclosure describes representing avideo-content object in a vector space in a particular manner, thisdisclosure contemplates representing a video-content object in a vectorspace in any suitable manner.

In particular embodiments, social-networking system 160 may calculate asimilarity metric of feature vectors in vector space 300. A similaritymetric may be a cosine similarity, a Minkowski distance, a Mahalanobisdistance, a Jaccard similarity coefficient, or any other suitablesimilarity metric. As an example and not by way of limitation, asimilarity metric of

and

may be a cosine similarity

$\frac{\overset{\rightharpoonup}{v_{1}} \cdot \overset{\rightarrow}{v_{2}}}{{\overset{\rightharpoonup}{v_{1}}}{\overset{\rightarrow}{v_{2}}}}.$

As another example and not by way of limitation, a similarity metric of

and

may be a Euclidean distance ∥

−

∥. A similarity metric of two feature vectors may represent how similarthe two objects corresponding to the two feature vectors, respectively,are to one another, as measured by the distance between the two featurevectors in the vector space 300. As an example and not by way oflimitation, feature vector 310 and feature vector 320 may correspond tovideo-content objects that are more similar to one another than thevideo-content objects corresponding to feature vector 310 and featurevector 330, based on the distance between the respective featurevectors. In particular embodiments, social-networking system 160 maydetermine a cluster of vector space 300. A cluster may be a set of oneor more points corresponding to feature vectors of objects or n-grams invector space 300, and the objects or n-grams whose feature vectors arein the cluster may belong to the same class or have some semanticrelationship to one another. As an example and not by way of limitation,a cluster may correspond to sports-related content and another clustermay correspond to food-related content. Although this disclosuredescribes calculating similarity metrics in a particular manner, thisdisclosure contemplates calculating similarity metrics in any suitablemanner.

More information on vector spaces, embeddings, feature vectors, andsimilarity metrics may be found in U.S. patent application Ser. No.14/949,436, filed 23 Nov. 2015, U.S. patent application Ser. No.14/981,413, filed 28 Dec. 2015, U.S. patent application Ser. No.15/286,315, filed 5 Oct. 2016, and U.S. patent application Ser. No.15/365,789, filed 30 Nov. 2016, each of which is incorporated byreference.

In particular embodiments, a video understanding platform may be trainedby machine learning to make a prediction about a video-content objectbased on one or more of: frames of the video-content object, audio ofthe video-content object, and text associated with the video-contentobject. In particular embodiments, a video understanding platform maycomprise a video-recognition model, an audio-recognition model, and atext-recognition model. A video-recognition model may be trained bymachine learning to make a prediction about a video-content object basedon an analysis of one or more frames (e.g., a still image) of thevideo-content object. An audio-recognition model may be trained bymachine learning to make a prediction about a video-content object basedon an analysis of part or all of the audio of a video-content object(e.g., speech identification, language identification, soundidentification, source separation, etc.). A text-recognition module maybe trained by machine learning to make a prediction about avideo-content object based on text associated with the video-contentobject (e.g., posts or comments associated with a video-content objectposted on an online social network, text metadata associated with thevideo-content object, topic classification information associated withthe video-content object, intent understanding information associatedwith the video-content object, etc.). In particular embodiments, aprediction about a video-content object may comprise a context, apredicted future action, a predicted object, a predicted motion, or anyother suitable prediction. A context of a video-content object may beone or more n-grams that describe the video-content object or an aspectof the video-content object (e.g., a description of objects or actionsdepicted, a category of the video-content object, etc.). In particularembodiments, a computer-vision platform may update a prediction about avideo-content object based on information not used to make a priorprediction (e.g., information received after the prior prediction wasmade). As an example and not by way of limitation, a video-contentobject may be a video that is streamed live and information (e.g.,likes, comments, shares, video content, etc.) may be received in anongoing manner and the computer-vision platform may update a predictionbased on this information. Although this disclosure may describe aparticular video understanding platform, this disclosure contemplatesany suitable video understanding platform.

FIG. 4 illustrates an example video understanding engine 400. Inparticular embodiments, video understanding engine 400 may comprise avideo-recognition module 410, a text-recognition module 420, and anaudio-recognition module 430. In particular embodiments,video-recognition module 410 may be trained by machine learning toreceive a feature vector representing a video-content object based onone or more frames of the video-content object and output a predictionabout the video-content object. In particular embodiments,text-recognition module 420 may be trained by machine learning toreceive a feature vector representing a video-content object based ontext associated with the video-content object and output a predictionabout the video-content object. In particular embodiments,audio-recognition module 430 may be trained by machine learning toreceive a feature vector representing a video-content object based onone or more portions of audio of the video-content object and output aprediction about the video-content object. Although this disclosure maydescribe a particular video understanding engine, this disclosurecontemplates any suitable video understanding engine.

In particular embodiments, a video-content object may comprise framesand audio. As an example and not by way of limitation, the video-contentobject may be a video file (e.g., MP4, WMV, AVI, etc.) comprising videodata in a video format (e.g., VP9, HEVC/H.265, etc.) and audio data inan audio format (e.g., MP3, AAC, Vorbis, FLAC, Opus, etc.). Inparticular embodiments, the video-content object may be associated withtext. As an example and not by way of limitation, the video-contentobject may be associated with metadata. The metadata may includeinformation about the production of the video-content object (e.g., thedate, location, or author of the video), descriptive information aboutthe video-content object (e.g., a summary of the video-content object,identities of people depicted, background information about an eventdepicted, why the video-content object was created, etc.), informationabout the content type (e.g., news report, birthday party, live stream,etc.), keywords associated with the video-content object, technicalinformation about the video-content object (e.g., format, file size,duration, format, etc.), a transcript of the video-content object, orany other suitable metadata. As another example and not by way oflimitation, the video-content object may be posted on an online socialnetwork and have associated text such as a post or a comment. Althoughthis disclosure may describe a particular video-content object, thisdisclosure contemplates any suitable video-content object.

In particular embodiments, social-networking system 160 may access afeature vector representing the video-content object based on one ormore frames of the video-content object. As an example and not by way oflimitation, the feature vector may be determined based on featureextraction of features of the one or more frames. As another example andnot by way of limitation, the feature vector of the video-content objectmay be based on one or more feature vectors of the one or more frames(e.g., pooling the feature vectors). In particular embodiments, thevideo-content object may correspond to a node in a social graph of thesocial-networking system 160. In particular embodiments, thevideo-content object may be stored in a data store (e.g., a social-graphdatabase) and social-networking system 160 may access the feature vectorfrom the data store. In particular embodiments, the social-networkingsystem 160 may access the feature vector by accessing the video-contentobject and mapping the video-content object to the feature vector.Although this disclosure may describe accessing a feature vector in aparticular manner, this disclosure contemplates accessing a featurevector in any suitable particular manner.

In particular embodiments, social-networking system 160 may access afeature vector representing the video-content based on at least some ofthe text associated with the video-content object. In particularembodiments, the text associated with the video-content object may be atranscript of one or more portions of the audio, metadata associatedwith the video-content object, or a post by a user of thesocial-networking system associated with the video-content object. As anexample and not by way of limitation, the video-content object may beposted on social-networking system 160, and social-networking system 160may access a feature vector based on a comment associated with thevideo-content object posted on the online social network. As anotherexample and not by way of limitation, social-networking system 160 mayaccess a feature vector based on metadata associated with thevideo-content object that indicates that the video-content object wascreated by a particular user. In particular embodiments, a featurevector may be based on topic classification information associated withthe video-content object. As an example and not by way of limitation, avideo-content object may have an associated topic comprising the text“opera” and a feature vector may be based on the text “opera.” Inparticular embodiments, social-networking system 160 may train alanguage module (e.g., by machine learning) based on the text. A featurevector may be based on the output of a language module. Although thisdisclosure may describe accessing a feature vector in a particularmanner, this disclosure contemplates accessing a feature vector in anysuitable particular manner.

In particular embodiments, social-networking system 160 may access afeature vector representing the video-content object based on one ormore portions of the audio. As an example and not by way of limitation,the video-content object may comprise a video of a birthday party andthe feature vector may be based on a portion of audio where people sing“Happy Birthday to You.” As another example and not by way oflimitation, the video-content object may comprise a video of a blackbirdand a feature vector may be based on a portion of audio where theblackbird vocalizes (i.e., its bird song). In particular embodiments, afeature vector may be based on audio analysis. As an example and not byway of limitation, the feature vector may be based on identification ofspeech, an identified language of speech, an identified sound (e.g., adog barking), source separation (e.g., an identified number of speakers,separating different sound sources into separate audio tracks, etc.), orany other suitable information based on audio analysis. Although thisdisclosure may describe accessing a feature vector in a particularmanner, this disclosure contemplates accessing a feature vector in anysuitable particular manner.

In particular embodiments, input to fusion module 440 may comprise oneor more predictions made by video-recognition module 410,text-recognition module 420, or audio-recognition module 430. As anexample and not by way of limitation, a video-content object may depicta boxing match. Video-recognition module 410 may predict that thevideo-content object depicts boxing based on one or more frames of thevideo-content object (e.g., by extracting features, such as images ofboxers wearing boxing gloves, the boxing ring, the referee, etc.).Text-recognition module 420 may predict that the video-content objectdepicts a fight based on text associated with the video-content object,such as “fight,” “punch,” or “knockout.” Audio-recognition module 430may predict that the video-content object depicts a sporting event basedon a portion of the audio, such as audio of the crowd cheering orcommentary provided by sportscasters. Each of these predictions may beused as an input to fusion module 440. Fusion module 440 may output aprediction (e.g., that the video-content object depicts a boxing match)or a feature vector representing the video-content object. Although thisdisclosure may describe determining a feature vector in a particularmanner, this disclosure contemplates determining a feature vector in anysuitable particular manner.

In particular embodiments, one of video-recognition module 410,text-recognition module 420, and audio-recognition module 430 maygenerate a feature vector based on one or more outputs of another one ofvideo-recognition module 410, text-recognition module 420, andaudio-recognition module 430. As an example and not by way oflimitation, audio-recognition module 430 may output a predictedtranscript of a video-content object. This predicted transcript maycomprise text and be used as an input to text-recognition module 420. Asanother example and not by way of limitation, video-recognition module410 may generate an intermediate output prediction and the intermediateoutput prediction may be used as an input to audio-recognition module430. Although this disclosure may describe particular inputs andoutputs, this disclosure contemplates any suitable inputs and outputs.

In particular embodiments, fusion module 440 may be trained to take asinputs one or more of the frames of a video-content object, textassociated with the video-content object, and one or more portions ofthe audio of the video-content object and output a feature vectorrepresenting the video-content object based on a combination the inputs.Video understanding engine 440 may comprise a fusion module 440, but notvideo-recognition module 410, text-recognition module 420, oraudio-recognition module 430. Additionally or alternatively, fusionmodule 440 may output a prediction about the video-content object. Inparticular embodiments, video understanding engine 400 may comprise oneor more of video-recognition module 410, text-recognition module 420, oraudio-recognition module 430, fusion module 440, configured in anysuitable manner. Although this disclosure may describe a particularvideo understanding engine, this disclosure contemplates any suitablevideo understanding engine.

In particular embodiments, video understanding engine 400 may comprise afusion module 440. Fusion module 440 may be trained by machine learningto make a prediction about the video-content object based on one or moreframes of the video-content object, text associated with thevideo-content object, and one or more portions of audio of thevideo-content object. In particular embodiments, fusion module 440 maybe trained by machine learning to determine a feature vectorrepresenting the video-content object based on a combination of afeature vector based on one or more frames of the video-content object,a feature vector based on text associated with the video-content object,and a feature vector based on one or more portions of audio of thevideo-content object. As an example and not by way of limitation, avideo-content object may depict a birthday party. A feature vector basedon one or more frames of the video-content object may be input to fusionmodule 440, which may be based on recognizing objects depicted in theframes, such as a birthday cake or party hats. A feature vector based ontext associated with the video-content object, such as posts on anonline social network that include the text “Happy Birthday,” or thetitle for the video-content object “My Birthday Party,” may be inputinto fusion module 440. A feature vector based on one or more portionsof audio of the video-content object, such as the audio of a group ofpeople depicted in the video-content object singing the Happy BirthdaySong, may be input into fusion module 440. Fusion module 440 may outputa feature vector representing the video-content object based on acombination of the inputted feature vectors. Although this disclosuremay describe determining a feature vector in a particular manner, thisdisclosure contemplates determining a feature vector in any suitableparticular manner.

In particular embodiments, fusion module 440 may determine a context ofthe video-content object. Fusion module 440 may be trained by machinelearning, to determine the context based on a feature vectorrepresenting the video-content object, the feature vector being based ona combination of a feature vector based on one or more frames of thevideo-content object, a feature vector based on at least some of thetext associated with the video-content object, and a feature vectorbased on one or more portions of audio of the video-content object. Inparticular embodiments, fusion module 440 may determine a context of thevideo-content object based on social-graph information based at least inpart on one or more nodes or edges connected to the node correspondingto the video-content object. As an example and not by way of limitation,a video-content object may be posted on a user's page on an onlinesocial network. The user may have posted the video on her birthday, asdetermined by the user profile of the user. The context may be that thevideo-content object depicts the user's birthday party, as determined bya feature vector feature vector representing the video-content objectand the social-graph information. In particular embodiments, determininga context of a video-content object may comprise recognizing a physicalobject (e.g., a book), identifying a particular physical object (e.g.,the book “Oh, The Places You'll Go!” by Dr. Seuss), detecting a physicalobject, tracking a physical object, recognizing a pose (e.g., sitting,standing, etc.), recognizing a face, determining a topic (e.g., sports,politics, documentary, etc.), recognizing a scene (e.g., classroom,forest, etc.), recognizing an activity (e.g., throwing a ball, jogging,etc.), recognizing behavior (e.g., laughing, crying, etc.), orrecognizing any other information associated with the video-contentobject. Although this disclosure may describe determining a context of avideo-content object in a particular manner, this disclosurecontemplates determining a context of a video-content object in anysuitable manner.

In particular embodiments, social-networking system 160 may receive arequest to access the video-content object from a client device of auser of the social-networking system. In particular embodiments,social-networking system 160 may generate a recommendation for a secondvideo-content object based on the feature vector of the video-contentobject and a user profile for the user. As an example and not by way oflimitation, a user may access a first video-content object depicting areview of a mobile phone. Fusion module 440 may determine a featurevector representing the first video-content object. Further, based onsocial-graph information, social-networking system 160 may determinethat the user is age 23 and likes the company APPLE. Social-networkingsystem 160 may generate a recommendation for a second video-contentobject that features the APPLE IPHONE based on a similarity metricbetween the feature vector representing the first video-content objectand a feature vector representing the second video-content object, andbased on determining that users between the ages of 18 and 25 tend toprefer APPLE IPHONEs to other mobile phones. Social-networking system160 may send, to the user's client device, the recommendation for thesecond video-content object. Although this disclosure may describerecommending a video-content object in a particular manner, thisdisclosure contemplates recommending a video-content object in anysuitable manner.

In particular embodiments, determining the context of the video-contentobject may comprise determining that the video-content object isinappropriate. As an example and not by way of limitation, fusion module440 may output a prediction that a video-content object depicts nudityor sexual content, violent or graphic content, hateful content (e.g.,promotes or condones violence against individuals or groups), fraudulentor misleading content (e.g., a pyramid scheme), harmful or dangerouscontent (e.g., encourages others to do harmful activities), threateningmaterial, or material that violates copyright law. In particularembodiments, social-networking system 160 may remove a secondvideo-content object based on determining that the video-content objectand the second video-content object are similar based on the featurevector for the video-content object and a feature vector for the secondvideo-content object. As an example and not by way of limitation, fusionmodule 440 may determine that a video-content object depicts materialthat depicts violent content. Social-networking system 160 may determinethat a second video-content object is similar to the video-contentobject based on a cosine similarity between a feature vectorrepresenting the video-content object and a feature vector representingthe second video-content object. Based on determining that the secondvideo-content object is similar to the video-content object,social-networking system 160 may remove the second video-content object.Although this disclosure may describe determining that the video-contentobject is inappropriate and removing a video-content object in aparticular manner, this disclosure contemplates determining that thevideo-content object is inappropriate and removing a video-contentobject in any suitable manner.

In particular embodiments, social-networking system 160 may receive, aquery associated with the video-content object from a client device of auser of the social-networking system. The user may submit the query tothe social-networking system 160 by, for example, selecting a queryinput or inputting text into query field. A user of an online socialnetwork may search for information relating to a specific subject matter(e.g., users, concepts, external content or resource) by providing ashort phrase describing the subject matter, often referred to as a“search query,” to a search engine. The query may be an unstructuredtext query and may comprise one or more text strings (which may includeone or more n-grams). In general, a user may input any character stringinto a query field to search for content on the social-networking system160 that matches the text query. The query may comprise a plurality ofn-grams. As an example and not by way of limitation, the querying usermay have inputted the query “cats afraid of cucumbers.” Although thisdisclosure describes receiving a query in a particular manner, thisdisclosure contemplates receiving a query in any suitable manner.

In particular embodiments, social-networking system 160 may identify oneor more objects matching the query. Social-networking system 160 maysearch a data store (or, in particular, a social-graph database) toidentify content matching the query. The search engine may conduct asearch based on the query phrase using various search algorithms andgenerate search results that identify resources or content (e.g.,user-profile interfaces, content-profile interfaces, or externalresources) that are most likely to be related to the search query.Although this disclosure describes identifying objects matching a queryin a particular manner, this disclosure contemplates identifying objectsmatching a query in any suitable manner.

In particular embodiments, social-networking system 160 may, for eachidentified objects, access a feature vector representing the identifiedobject. Social-networking system 160 may map objects to feature vectorsby feature extraction, or access a cached feature vector for an objectthat has been previously mapped. In particular embodiments,social-networking system 160 may rank each identified object based on asimilarity metric between the feature vector representing thevideo-content object and the feature vector representing the identifiedobject. As an example and not by way of limitation, the similaritymetric may be a cosine similarity between the feature vectorrepresenting the video-content object and the feature vectorrepresenting the identified object. The objects may be ranked higher ifthe cosine similarity associated with the object is larger. Inparticular embodiments, social-networking system 160 may send, to theclient system in response to the query, one or more search resultscorresponding to one or more of the identified objects, respectively,each identified object corresponding to a search result having a rankgreater than a threshold rank. As an example and not by way oflimitation, a threshold rank may be a static number (e.g., 0.8). Asanother example and not by way of limitation, a threshold rank may bedetermined such that a particular number of search results are sent tothe user (e.g., the threshold rank may be determined such that 50 searchresults corresponding to the top-ranked identified objects have a rankgreater than the threshold rank). Although this disclosure describesranking and sending objects in a particular manner, this disclosurecontemplates ranking and sending objects in any suitable manner.

FIG. 5 illustrates an example method 500 for determining a context of avideo-content object. The method may begin at step 510, wheresocial-networking system 160 may access a first feature vectorrepresenting a video-content object corresponding to a node in a socialgraph of a social-networking system, wherein: the video-content objectcomprises frames and audio and is associated with text, the firstfeature vector is based on one or more of the frames of thevideo-content object, and the social graph comprises a plurality ofnodes and edges connecting the nodes. At step 520, social-networkingsystem 160 may access a second feature vector representing thevideo-content object, wherein the second feature vector is based on atleast some of the text. At step 530, social-networking system 160 mayaccess a third feature vector representing the video-content object,wherein the third feature vector is based on one or more portions of theaudio. At step 540, social-networking system 160 may determine a fourthfeature vector representing the video-content object, wherein the fourthfeature vector is based on a combination of the first, second, and thirdfeature vectors. At step 550, social-networking system 160 may determinea context of the video-content object based on the fourth feature vectorand social-graph information based at least in part on one or more nodesor edges connected to the node corresponding to the video-contentobject. Particular embodiments may repeat one or more steps of themethod of FIG. 5, where appropriate. Although this disclosure describesand illustrates particular steps of the method of FIG. 5 as occurring ina particular order, this disclosure contemplates any suitable steps ofthe method of FIG. 5 occurring in any suitable order. Moreover, althoughthis disclosure describes and illustrates an example method fordetermining a context of a video-content object including the particularsteps of the method of FIG. 5, this disclosure contemplates any suitablemethod for determining a context of a video-content object including anysuitable steps, which may include all, some, or none of the steps of themethod of FIG. 5, where appropriate. Furthermore, although thisdisclosure describes and illustrates particular components, devices, orsystems carrying out particular steps of the method of FIG. 5, thisdisclosure contemplates any suitable combination of any suitablecomponents, devices, or systems carrying out any suitable steps of themethod of FIG. 5.

FIG. 6 illustrates an example computer system 600. In particularembodiments, one or more computer systems 600 perform one or more stepsof one or more methods described or illustrated herein. In particularembodiments, one or more computer systems 600 provide functionalitydescribed or illustrated herein. In particular embodiments, softwarerunning on one or more computer systems 600 performs one or more stepsof one or more methods described or illustrated herein or providesfunctionality described or illustrated herein. Particular embodimentsinclude one or more portions of one or more computer systems 600.Herein, reference to a computer system may encompass a computing device,and vice versa, where appropriate. Moreover, reference to a computersystem may encompass one or more computer systems, where appropriate.

This disclosure contemplates any suitable number of computer systems600. This disclosure contemplates computer system 600 taking anysuitable physical form. As example and not by way of limitation,computer system 600 may be an embedded computer system, a system-on-chip(SOC), a single-board computer system (SBC) (such as, for example, acomputer-on-module (COM) or system-on-module (SOM)), a desktop computersystem, a laptop or notebook computer system, an interactive kiosk, amainframe, a mesh of computer systems, a mobile telephone, a personaldigital assistant (PDA), a server, a tablet computer system, anaugmented/virtual reality device, or a combination of two or more ofthese. Where appropriate, computer system 600 may include one or morecomputer systems 600; be unitary or distributed; span multiplelocations; span multiple machines; span multiple data centers; or residein a cloud, which may include one or more cloud components in one ormore networks. Where appropriate, one or more computer systems 600 mayperform without substantial spatial or temporal limitation one or moresteps of one or more methods described or illustrated herein. As anexample and not by way of limitation, one or more computer systems 600may perform in real time or in batch mode one or more steps of one ormore methods described or illustrated herein. One or more computersystems 600 may perform at different times or at different locations oneor more steps of one or more methods described or illustrated herein,where appropriate.

In particular embodiments, computer system 600 includes a processor 602,memory 604, storage 606, an input/output (I/O) interface 608, acommunication interface 610, and a bus 612. Although this disclosuredescribes and illustrates a particular computer system having aparticular number of particular components in a particular arrangement,this disclosure contemplates any suitable computer system having anysuitable number of any suitable components in any suitable arrangement.

In particular embodiments, processor 602 includes hardware for executinginstructions, such as those making up a computer program. As an exampleand not by way of limitation, to execute instructions, processor 602 mayretrieve (or fetch) the instructions from an internal register, aninternal cache, memory 604, or storage 606; decode and execute them; andthen write one or more results to an internal register, an internalcache, memory 604, or storage 606. In particular embodiments, processor602 may include one or more internal caches for data, instructions, oraddresses. This disclosure contemplates processor 602 including anysuitable number of any suitable internal caches, where appropriate. Asan example and not by way of limitation, processor 602 may include oneor more instruction caches, one or more data caches, and one or moretranslation lookaside buffers (TLBs). Instructions in the instructioncaches may be copies of instructions in memory 604 or storage 606, andthe instruction caches may speed up retrieval of those instructions byprocessor 602. Data in the data caches may be copies of data in memory604 or storage 606 for instructions executing at processor 602 tooperate on; the results of previous instructions executed at processor602 for access by subsequent instructions executing at processor 602 orfor writing to memory 604 or storage 606; or other suitable data. Thedata caches may speed up read or write operations by processor 602. TheTLBs may speed up virtual-address translation for processor 602. Inparticular embodiments, processor 602 may include one or more internalregisters for data, instructions, or addresses. This disclosurecontemplates processor 602 including any suitable number of any suitableinternal registers, where appropriate. Where appropriate, processor 602may include one or more arithmetic logic units (ALUs); be a multi-coreprocessor; or include one or more processors 602. Although thisdisclosure describes and illustrates a particular processor, thisdisclosure contemplates any suitable processor.

In particular embodiments, memory 604 includes main memory for storinginstructions for processor 602 to execute or data for processor 602 tooperate on. As an example and not by way of limitation, computer system600 may load instructions from storage 606 or another source (such as,for example, another computer system 600) to memory 604. Processor 602may then load the instructions from memory 604 to an internal registeror internal cache. To execute the instructions, processor 602 mayretrieve the instructions from the internal register or internal cacheand decode them. During or after execution of the instructions,processor 602 may write one or more results (which may be intermediateor final results) to the internal register or internal cache. Processor602 may then write one or more of those results to memory 604. Inparticular embodiments, processor 602 executes only instructions in oneor more internal registers or internal caches or in memory 604 (asopposed to storage 606 or elsewhere) and operates only on data in one ormore internal registers or internal caches or in memory 604 (as opposedto storage 606 or elsewhere). One or more memory buses (which may eachinclude an address bus and a data bus) may couple processor 602 tomemory 604. Bus 612 may include one or more memory buses, as describedbelow. In particular embodiments, one or more memory management units(MMUs) reside between processor 602 and memory 604 and facilitateaccesses to memory 604 requested by processor 602. In particularembodiments, memory 604 includes random access memory (RAM). This RAMmay be volatile memory, where appropriate Where appropriate, this RAMmay be dynamic RAM (DRAM) or static RAM (SRAM). Moreover, whereappropriate, this RAM may be single-ported or multi-ported RAM. Thisdisclosure contemplates any suitable RAM. Memory 604 may include one ormore memories 604, where appropriate. Although this disclosure describesand illustrates particular memory, this disclosure contemplates anysuitable memory.

In particular embodiments, storage 606 includes mass storage for data orinstructions. As an example and not by way of limitation, storage 606may include a hard disk drive (HDD), a floppy disk drive, flash memory,an optical disc, a magneto-optical disc, magnetic tape, or a UniversalSerial Bus (USB) drive or a combination of two or more of these. Storage606 may include removable or non-removable (or fixed) media, whereappropriate. Storage 606 may be internal or external to computer system600, where appropriate. In particular embodiments, storage 606 isnon-volatile, solid-state memory. In particular embodiments, storage 606includes read-only memory (ROM). Where appropriate, this ROM may bemask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM),electrically erasable PROM (EEPROM), electrically alterable ROM (EAROM),or flash memory or a combination of two or more of these. Thisdisclosure contemplates mass storage 606 taking any suitable physicalform. Storage 606 may include one or more storage control unitsfacilitating communication between processor 602 and storage 606, whereappropriate. Where appropriate, storage 606 may include one or morestorages 606. Although this disclosure describes and illustratesparticular storage, this disclosure contemplates any suitable storage.

In particular embodiments, I/O interface 608 includes hardware,software, or both, providing one or more interfaces for communicationbetween computer system 600 and one or more I/O devices. Computer system600 may include one or more of these I/O devices, where appropriate. Oneor more of these I/O devices may enable communication between a personand computer system 600. As an example and not by way of limitation, anI/O device may include a keyboard, keypad, microphone, monitor, mouse,printer, scanner, speaker, still camera, stylus, tablet, touch screen,trackball, video camera, another suitable I/O device or a combination oftwo or more of these. An I/O device may include one or more sensors.This disclosure contemplates any suitable I/O devices and any suitableI/O interfaces 608 for them. Where appropriate, I/O interface 608 mayinclude one or more device or software drivers enabling processor 602 todrive one or more of these I/O devices. I/O interface 608 may includeone or more I/O interfaces 608, where appropriate. Although thisdisclosure describes and illustrates a particular I/O interface, thisdisclosure contemplates any suitable I/O interface.

In particular embodiments, communication interface 610 includeshardware, software, or both providing one or more interfaces forcommunication (such as, for example, packet-based communication) betweencomputer system 600 and one or more other computer systems 600 or one ormore networks. As an example and not by way of limitation, communicationinterface 610 may include a network interface controller (NIC) ornetwork adapter for communicating with an Ethernet or other wire-basednetwork or a wireless NIC (WNIC) or wireless adapter for communicatingwith a wireless network, such as a WI-FI network. This disclosurecontemplates any suitable network and any suitable communicationinterface 610 for it. As an example and not by way of limitation,computer system 600 may communicate with an ad hoc network, a personalarea network (PAN), a local area network (LAN), a wide area network(WAN), a metropolitan area network (MAN), or one or more portions of theInternet or a combination of two or more of these. One or more portionsof one or more of these networks may be wired or wireless. As anexample, computer system 600 may communicate with a wireless PAN (WPAN)(such as, for example, a BLUETOOTH WPAN), a WI-FI network, a WI-MAXnetwork, a cellular telephone network (such as, for example, a GlobalSystem for Mobile Communications (GSM) network), or other suitablewireless network or a combination of two or more of these. Computersystem 600 may include any suitable communication interface 610 for anyof these networks, where appropriate. Communication interface 610 mayinclude one or more communication interfaces 610, where appropriate.Although this disclosure describes and illustrates a particularcommunication interface, this disclosure contemplates any suitablecommunication interface.

In particular embodiments, bus 612 includes hardware, software, or bothcoupling components of computer system 600 to each other. As an exampleand not by way of limitation, bus 612 may include an AcceleratedGraphics Port (AGP) or other graphics bus, an Enhanced Industry StandardArchitecture (EISA) bus, a front-side bus (FSB), a HYPERTRANSPORT (HT)interconnect, an Industry Standard Architecture (ISA) bus, an INFINIBANDinterconnect, a low-pin-count (LPC) bus, a memory bus, a Micro ChannelArchitecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, aPCI-Express (PCIe) bus, a serial advanced technology attachment (SATA)bus, a Video Electronics Standards Association local (VLB) bus, oranother suitable bus or a combination of two or more of these. Bus 612may include one or more buses 612, where appropriate. Although thisdisclosure describes and illustrates a particular bus, this disclosurecontemplates any suitable bus or interconnect.

Herein, a computer-readable non-transitory storage medium or media mayinclude one or more semiconductor-based or other integrated circuits(ICs) (such, as for example, field-programmable gate arrays (FPGAs) orapplication-specific ICs (ASICs)), hard disk drives (HDDs), hybrid harddrives (HHDs), optical discs, optical disc drives (ODDs),magneto-optical discs, magneto-optical drives, floppy diskettes, floppydisk drives (FDDs), magnetic tapes, solid-state drives (SSDs),RAM-drives, SECURE DIGITAL cards or drives, any other suitablecomputer-readable non-transitory storage media, or any suitablecombination of two or more of these, where appropriate. Acomputer-readable non-transitory storage medium may be volatile,non-volatile, or a combination of volatile and non-volatile, whereappropriate.

Herein, “or” is inclusive and not exclusive, unless expressly indicatedotherwise or indicated otherwise by context. Therefore, herein, “A or B”means “A, B, or both,” unless expressly indicated otherwise or indicatedotherwise by context. Moreover, “and” is both joint and several, unlessexpressly indicated otherwise or indicated otherwise by context.Therefore, herein, “A and B” means “A and B, jointly or severally,”unless expressly indicated otherwise or indicated otherwise by context.

The scope of this disclosure encompasses all changes, substitutions,variations, alterations, and modifications to the example embodimentsdescribed or illustrated herein that a person having ordinary skill inthe art would comprehend. The scope of this disclosure is not limited tothe example embodiments described or illustrated herein. Moreover,although this disclosure describes and illustrates respectiveembodiments herein as including particular components, elements,feature, functions, operations, or steps, any of these embodiments mayinclude any combination or permutation of any of the components,elements, features, functions, operations, or steps described orillustrated anywhere herein that a person having ordinary skill in theart would comprehend. Furthermore, reference in the appended claims toan apparatus or system or a component of an apparatus or system beingadapted to, arranged to, capable of, configured to, enabled to, operableto, or operative to perform a particular function encompasses thatapparatus, system, component, whether or not it or that particularfunction is activated, turned on, or unlocked, as long as thatapparatus, system, or component is so adapted, arranged, capable,configured, enabled, operable, or operative. Additionally, although thisdisclosure describes or illustrates particular embodiments as providingparticular advantages, particular embodiments may provide none, some, orall of these advantages.

What is claimed is:
 1. A method comprising: by one or more computingdevices, accessing a video-content object; by one or more computingdevices, determining a first feature vector representing thevideo-content object using a first recognition module of a first typebased on an object in the video-content object; by one or more computingdevices, determining a second feature vector representing thevideo-content object using a second recognition module of a second typebased on the first feature vector, wherein the first type is differentfrom the second type; and by one or more computing devices, determininga context of the video-content object based on the second featurevector.
 2. The method of claim 1, wherein: the first recognition moduleis an audio-recognition module; the first feature vector represents apredicted transcript of the video-content object, wherein the transcriptcomprises text; and the second recognition module is a text-recognitionmodule.
 3. The method of claim 1, wherein: the first recognition moduleis a video-recognition module and the second recognition module is atext-recognition module; the first recognition module is avideo-recognition module and the second recognition module is anaudio-recognition module; the first recognition module is atext-recognition module and the second recognition module is avideo-recognition module; the first recognition module is atext-recognition module and the second recognition module is anaudio-recognition module; the first recognition module is anaudio-recognition module and the second recognition module is avideo-recognition module; or the first recognition module is anaudio-recognition module and the second recognition module is atext-recognition module.
 4. The method of claim 1, wherein: thevideo-content object corresponds to a node in a social graph of asocial-networking system; the social graph comprises a plurality ofnodes and edges connecting the nodes; and the context of thevideo-content object is determined based on social-graph informationbased at least in part on one or more nodes or edges connected to thenode corresponding to the video-content object, in addition to thesecond feature vector.
 5. The method of claim 1, wherein: thevideo-content object comprises frames and audio and is associated withtext; and the object in the video-content object is one of: one or moreof the frames; one or more portions of the audio; or at least some ofthe text.
 6. The method of claim 1, wherein: the first recognitionmodule is a video-recognition module; the first feature vectorrepresents an intermediate output prediction; and the second recognitionmodule is an audio-recognition module.
 7. The method of claim 1, furthercomprising: by one or more computing devices, determining a thirdfeature vector representing the video-content object using a thirdrecognition module of a third type based on at least one of the firstfeature vector and the second feature vector, wherein the third type isdifferent from the first and second types; and by one or more computingdevices, determining a context of the video-content object based on thethird feature vector.
 8. The method of claim 1, wherein determining thefirst feature vector comprises: extracting at least one feature fromeach frame of a first set of frames of the video-content object togenerate a first set of feature vectors; and polling two or more of thefirst set of feature vectors to generate the first feature vector. 9.One or more computer-readable non-transitory storage media embodyingsoftware that is operable when executed to: access a video-contentobject; determine a first feature vector representing the video-contentobject using a first recognition module of a first type based on anobject in the video-content object; determine a second feature vectorrepresenting the video-content object using a second recognition moduleof a second type based on the first feature vector, wherein the firsttype is different from the second type; and determine a context of thevideo-content object based on the second feature vector.
 10. The mediaof claim 9, wherein: the first recognition module is anaudio-recognition module; the first feature vector represents apredicted transcript of the video-content object, wherein the transcriptcomprises text; and the second recognition module is a text-recognitionmodule.
 11. The media of claim 9, wherein: the first recognition moduleis a video-recognition module and the second recognition module is atext-recognition module; the first recognition module is avideo-recognition module and the second recognition module is anaudio-recognition module; the first recognition module is atext-recognition module and the second recognition module is avideo-recognition module; the first recognition module is atext-recognition module and the second recognition module is anaudio-recognition module; the first recognition module is anaudio-recognition module and the second recognition module is avideo-recognition module; or the first recognition module is anaudio-recognition module and the second recognition module is atext-recognition module.
 12. The media of claim 9, wherein: thevideo-content object corresponds to a node in a social graph of asocial-networking system; the social graph comprises a plurality ofnodes and edges connecting the nodes; and the context of thevideo-content object is determined based on social-graph informationbased at least in part on one or more nodes or edges connected to thenode corresponding to the video-content object, in addition to thesecond feature vector.
 13. The media of claim 9, wherein: thevideo-content object comprises frames and audio and is associated withtext; and the object in the video-content object is one of: one or moreof the frames; one or more portions of the audio; or at least some ofthe text.
 14. The media of claim 9, wherein: the first recognitionmodule is a video-recognition module; the first feature vectorrepresents an intermediate output prediction; and the second recognitionmodule is an audio-recognition module.
 15. The media of claim 9, whereinthe software is further operable when executed to: determine a thirdfeature vector representing the video-content object using a thirdrecognition module of a third type based on at least one of the firstfeature vector and the second feature vector, wherein the third type isdifferent from the first and second types; and determine a context ofthe video-content object based on the third feature vector.
 16. Themedia of claim 9, wherein the software is operable to determine thefirst feature vector by: extracting at least one feature from each frameof a first set of frames of the video-content object to generate a firstset of feature vectors; and polling two or more of the first set offeature vectors to generate the first feature vector.
 17. A systemcomprising: one or more processors; and a memory coupled to theprocessors and comprising instructions operable when executed by theprocessors to cause the processors to: access a video-content object;determine a first feature vector representing the video-content objectusing a first recognition module of a first type based on an object inthe video-content object; determine a second feature vector representingthe video-content object using a second recognition module of a secondtype based on the first feature vector, wherein the first type isdifferent from the second type; and determine a context of thevideo-content object based on the second feature vector.
 18. The systemof claim 17, wherein: the first recognition module is anaudio-recognition module; the first feature vector represents apredicted transcript of the video-content object, wherein the transcriptcomprises text; and the second recognition module is a text-recognitionmodule.
 19. The system of claim 17, wherein: the first recognitionmodule is a video-recognition module and the second recognition moduleis a text-recognition module; the first recognition module is avideo-recognition module and the second recognition module is anaudio-recognition module; the first recognition module is atext-recognition module and the second recognition module is avideo-recognition module; the first recognition module is atext-recognition module and the second recognition module is anaudio-recognition module; the first recognition module is anaudio-recognition module and the second recognition module is avideo-recognition module; or the first recognition module is anaudio-recognition module and the second recognition module is atext-recognition module.
 20. The method of claim 17, wherein: thevideo-content object corresponds to a node in a social graph of asocial-networking system; the social graph comprises a plurality ofnodes and edges connecting the nodes; and the context of thevideo-content object is determined based on social-graph informationbased at least in part on one or more nodes or edges connected to thenode corresponding to the video-content object, in addition to thesecond feature vector.